Once upon a time …

There was a forest full of forked roads, trees, hedges, and people too. Thus a story begins with the situation, a context. That’s the place where data resides. But life gets complicated as the hedges are cut down, people must make decisions with trade-offs about the roads and trees, and the weather turns bad. Questions arise, the environment and people respond. We have a very (generalized!) story brewing here.

Multi-level models are story-telling devices in their own right and every story keeps the flow, the data, the characters, what the characters do to one another, all as simple as possible. We pool data and shrink the number of parameters to build models both amnesic and a little agnostic. We will provide the extra memory and causal cautionary tales. One of the most important characteristics of any story is that one aspect of the story leaves you panting for another segment.

Pooling of information across aspects, features, segments of a story will help us understand and participate in the whole story. No pooling often leaves us wondering what our names are after hearing a story, let alone what the story is about. Full pooling leaves nothing to the imagination. Partial pooling allows story components to share enough information to bring all of the story together into a unity.

Imagine a cluster of customer observations from different markets. In this context pooling means using the information from other markets to inform our estimates for each market. A model with no pooling means that each market is the first market that we have ever seen, as other markets have no effect on our estimates. Models without pooling have amnesia, if they ever remembered at all.

As a consequence pooling produces shrinkage of parameter estimates as each market’s ensemble of estimates will gravitate with a centripetal force of attraction to the inertial center of the global mean across markets. Nice concepts, but how do multilevel models do this? Do small market sample sizes share more or less information than large sample size markets? We have much to ponder.

Parameters from a common distribution

Simulations allow us to know the true, so-called population, parameters of variates simply because we set it up that way. Because we are in charge of the data, we can check whether our design actually gets estimated correctly in a model.

In a multilevel model, the parameters for each market derive from the same (pool) of a common statistical population parameters. For example, the family of varying intercepts for each market in the market. As the model samples hypotheses for each paramter for each market in the market, the parameter learns about the sampling of other parameters in other segements of the same market pool.

The sampling of each parameter complements the sampling, and cross-market learning as a result, of all other parameters in the other markets. We thus have an adaptively regularized family of parameters. It is adaptive in that each parameter learns from the sampling, the learning, that happens in the other parameters. It is regularized in that the parameters effectively shrink in number and in size to the pool from which they are sampled. The amount of shrinkage is directly proportional to the estimated variation estimated and compatible with the data for the distribution of the family of parameters. The more influenced parameters are going to be those that come from markets with small sample sizes.

However, it is one thing to have some intuition and another one is to really understand something. When it comes to statistics, I am a big fan of simulation. Thankfully, Richard does precisely this in chapter 12. Let’s simulate a model to visualize both pooling and shrinking.

The model: A multilevel binomial

We simulate the number of students who passed some exam at different markets at one market. That is, each market has \(S_i\) students who passed the test, from a maximum of \(N_i\). The model then is the following:

\[ \begin{align} S_i &\sim Binomial(N_i, p_i), \\ logit(p_i) &= \alpha_{market_{[i]}}, \\ \alpha_j &\sim Normal(\overline{\alpha}, \sigma), \\ \overline{\alpha} &\sim Normal(0, 1.5), \\ \sigma &\sim Exponential(1) \end{align} \]

And, we know that we could also have a volatility model involved here as well. We also know that such a specification should also thicken or thin the posterior distribution tails. But we leave that for another simulation.

We have customers in markets. What do customers do? They visit the market and they sometimes transact for goods and services. We assume a distribution for the average log-odds of actually transacting for each market.

\[ \alpha_j \sim Normal(\bar{\alpha}, \sigma). \]

In this simulation the prior for each intercept will be a distribution that we will simultaneously learn as we learn the individual parameters. We have hyper-priors: priors for the parameters of the distribution of intercepts (\(\overline{\alpha}, \sigma\)).

Quite a lot to pull into the analysis. So an analyst must:

  • Not only state assumptions in priors

  • But also, state further, layered or multi-level, assumptions about a distribution-reverting mechanism

Is this too much for an analyst to do? We think not.

The simulation

To simulate this model, we define the parameters of the distribution of intercepts. For each market, we will simulate an average log-odds of transacting. Using the average log-odds of transacting we will simulate the number of customers in each market who transacted using the binomial bag of beans.

Epistemology answers the question, what is it to know? Our answer is some sort of probable inference. Ontology answers the question, given we know, what is it that we know? Neither hyper-parameters or parameters are in this schema? In a sense only that which is experienced, observed, sensed, is in the simulation of variates. Hyper-parameters, parameters and so on are considered by some to be the figments of furtive imaginations in analysts that make decisions about the relationship of parameters, otherwise known as a model. The model has a purpose, materials, a design, and an agency. It also possesses something of an objectivity, which in our case is the setting up of the data to test the model that emanates from the furtive analytical imagination.

In this way we begin by setting the parameters of the population of intercepts.

a_bar <- 1.5
sigma <- 1.5
n_markets <- 50
# customers per market
N_i <- as.integer(rep(c(5, 10, 25, 35, 40), each = 10))

With these parameters we can now simulate the average log-odds of transacting for the good (or service) for each of the markets by using, for example, a univariate normal random number.

avg_lod_odds_per_market <- rnorm(n_markets, mean = a_bar, sd = sigma)

Very Gaussian of us indeed. Perhaps we should try a Pareto power distributon with thresholds against which customers might transact, or not. But again we leave this idea, this thought experiment, for another time.

We now have the following simulation of data about the markets and customers within markets. Their only claim to fame is their transactions.

data_simulation <- data.frame(
  market = 1:n_markets, 
  N_i = N_i, 
  true_log_odds = avg_lod_odds_per_market)
glimpse(data_simulation)
## Rows: 50
## Columns: 3
## $ market        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ N_i           <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10,...
## $ true_log_odds <dbl> 3.5564377, 0.6529527, 2.0446926, 2.4492939, 2.1064025...

The nomer true_ refers to the data we have, in all of our proportionate ability, to name the actual, true data, the imposed objectivity of this simulation.

Simulate the tranactors

We transform the log odds of the logistic into the binomial distribution. With this prestidigitation we can generate the number of customers who transacted, by market.

In modeling parlance, we know that the logistic is the inverse of the logit. We thus transform log-odds into probability.

data_simulation <- data_simulation %>% 
  mutate(
    number_transacted = rbinom(n_markets, 
                          prob = logistic(true_log_odds), 
                          size = N_i)
    )
glimpse(data_simulation)
## Rows: 50
## Columns: 4
## $ market            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ N_i               <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10,...
## $ true_log_odds     <dbl> 3.5564377, 0.6529527, 2.0446926, 2.4492939, 2.106...
## $ number_transacted <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5, 10, 10, 5, 8, 8, 9,...
plt <- data_simulation %>% 
  ggplot( aes(market, number_transacted, color = N_i) ) +
  geom_point() +
  scale_color_viridis_c() +
  labs( title = "Simulated customers who transacted per market",
       color = "Initial #" )
ggplotly( plt )

Does this look reasonable? Yes as we eyeball the various market groupings of transacting customers. More customers transact in the upper markets than in the lower markets. We will be able to answer one of our questions about pooling and shrinkage: does market sample size matter?

To pool, or not to pool, that is the question

We have three possibilities.

  1. Do not pool at all. This is a model that has complete amnesia. It will not convey information across parameters.

  2. Partially pool. Allow some pooling up to a point of perhaps overloading the model’s fragile memory. We will allow some fuzziness to occur. After all don’t our simulated transactors wander in and out of each other’s markets?

  3. Pool everything. Hold on, we will know everything about everything? Does data, thinking about data with a purpose, acting on data to achieve a result live in this storied environment?

No-pooling estimates

Pooling means using the information from other markets to inform our predictions of estimated probabilities of customers transacting in different markets. Therefore, no-pooling means treating each market completely unrelated to other markets. This is equivalent to estimating an infinite value of the variance of the population of parameters.

Therefore, our estimate of the probability of transacting in each market will just be the raw sample proportion of customers transacting in each market.

data_simulation <- data_simulation %>% 
  mutate(estimated_probability_no_pooling = number_transacted / N_i)
glimpse(data_simulation)
## Rows: 50
## Columns: 5
## $ market                           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,...
## $ N_i                              <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, ...
## $ true_log_odds                    <dbl> 3.5564377, 0.6529527, 2.0446926, 2...
## $ number_transacted                <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5, 10, ...
## $ estimated_probability_no_pooling <dbl> 1.00, 0.80, 1.00, 1.00, 0.60, 0.40...

The lines are clear and drawn and uninformative about market relationships.

Partial pooling estimates

Partial pooling means to model explicitly the population of parameters. With the estimation of a mean and a standard deviation, we can regularize adaptively, and this action will skrink the parameters into our predictions. Thus we fit a multilevel binomial model of transacting customers in markets.

# trimmed data for estimation in ulam() and stan
data_model <- list(
  S_i = data_simulation$number_transacted, 
  N_i = data_simulation$N_i, 
  market = data_simulation$market
  )
#
multilevel_model <- alist(
  S_i ~ dbinom(N_i, p),
  logit(p) <- a_market[market], # each market will have its own average log odds of transacting
  a_market[market] ~ dnorm(a_bar, sigma),
  a_bar ~ dnorm(0, 1.5),
  sigma ~ dexp(1)
)

Then, we use ulam() and stan to derive the posterior distribution using the Hamilton Monte Carlo approach with the no-u-turns-sampler (NUTS).

multilevel_fit <- ulam(
  multilevel_model, 
  data = data_model, 
  chains = 1, log_lik = TRUE
  )
## 
## SAMPLING FOR MODEL '48b25e8658507514b9821a7920cf7938' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.317 seconds (Warm-up)
## Chain 1:                0.213 seconds (Sampling)
## Chain 1:                0.53 seconds (Total)
## Chain 1:
precis( multilevel_fit )
##           mean        sd      5.5%    94.5%    n_eff     Rhat4
## a_bar 1.347924 0.2547422 0.9840563 1.766829 645.6942 0.9980424
## sigma 1.615880 0.2537554 1.2578163 2.038842 277.1777 1.0081819

For one chain, this estimation pulls the parameter values fairly closely, within a standard deviation of the simulations, to the 1.5 values we chose for the data. The model so far seems somewhat compatible with the data. We might go so far to say that the model produces parameter estimates for the mean a_bar between 0.98and1.77` about 89% of the ways that are possibly compatible with our simulated data.

We can evaluate the validity of our Markov Chains using the following traceplots. If we were to run these plots against more than one chain we would likely conclude that the chains mix well across the two parameters, that they are fairly stable, and that the chains converged to nearly the same estimates.

traceplot_ulam(multilevel_fit)

Now, let’s find out how well the Markov Chain Monte Carlo estimates converged with Rhat4, a measure of convergence. A value of 1 means the MCMC estimates converged.

precis(multilevel_fit, depth = 2) %>% 
  data.frame() %>% 
  select(Rhat4) %>% 
  summary()
##      Rhat4       
##  Min.   :0.9980  
##  1st Qu.:0.9981  
##  Median :0.9984  
##  Mean   :0.9992  
##  3rd Qu.:0.9995  
##  Max.   :1.0082

The Rhat values indidate that we sampled correctly from our posterior distributons. Let’s use these samples from the posterior distribution to calculate our estimated log-odds of survival for each market.

posterior_samples <- extract.samples( multilevel_fit )
glimpse( posterior_samples )
## List of 3
##  $ a_market: num [1:500, 1:50] 2.453 2.239 0.539 2.036 1.489 ...
##  $ a_bar   : num [1:500(1d)] 1.31 1.52 1.44 1.23 1.78 ...
##  $ sigma   : num [1:500(1d)] 1.47 1.73 1.19 1.8 2.35 ...
##  - attr(*, "source")= chr "ulam posterior: 500 samples from object"
summary( posterior_samples$a_market )
##        V1                V2                V3                V4         
##  Min.   :-0.5103   Min.   :-0.7299   Min.   :-0.2144   Min.   :-0.7858  
##  1st Qu.: 1.8675   1st Qu.: 0.9432   1st Qu.: 1.8433   1st Qu.: 1.8150  
##  Median : 2.5627   Median : 1.4735   Median : 2.5150   Median : 2.6264  
##  Mean   : 2.6776   Mean   : 1.5207   Mean   : 2.7167   Mean   : 2.7077  
##  3rd Qu.: 3.4570   3rd Qu.: 2.0446   3rd Qu.: 3.5337   3rd Qu.: 3.4461  
##  Max.   : 6.9369   Max.   : 4.5156   Max.   : 7.1810   Max.   : 7.3637  
##        V5                V6                 V7              V8         
##  Min.   :-1.5217   Min.   :-2.57273   Min.   :0.148   Min.   :-2.3871  
##  1st Qu.: 0.1710   1st Qu.:-0.50475   1st Qu.:1.891   1st Qu.: 0.1323  
##  Median : 0.6937   Median : 0.03162   Median :2.577   Median : 0.6342  
##  Mean   : 0.7357   Mean   : 0.02944   Mean   :2.751   Mean   : 0.7044  
##  3rd Qu.: 1.2900   3rd Qu.: 0.54901   3rd Qu.:3.560   3rd Qu.: 1.2143  
##  Max.   : 3.0729   Max.   : 2.52856   Max.   :7.788   Max.   : 3.9193  
##        V9               V10                V11              V12        
##  Min.   :-0.0194   Min.   :-0.05306   Min.   :0.3511   Min.   :0.4774  
##  1st Qu.: 1.8346   1st Qu.: 1.84965   1st Qu.:2.2744   1st Qu.:2.3733  
##  Median : 2.5902   Median : 2.51379   Median :2.9991   Median :2.9889  
##  Mean   : 2.6794   Mean   : 2.68319   Mean   :3.1171   Mean   :3.0834  
##  3rd Qu.: 3.4116   3rd Qu.: 3.39019   3rd Qu.:3.7963   3rd Qu.:3.6593  
##  Max.   : 6.1589   Max.   : 6.88006   Max.   :7.8415   Max.   :8.2027  
##       V13               V14               V15               V16        
##  Min.   :-1.4966   Min.   :-0.2131   Min.   :-0.2055   Min.   :0.2798  
##  1st Qu.:-0.2213   1st Qu.: 1.0202   1st Qu.: 0.9217   1st Qu.:1.5365  
##  Median : 0.2107   Median : 1.4365   Median : 1.4498   Median :2.0481  
##  Mean   : 0.2362   Mean   : 1.5113   Mean   : 1.4734   Mean   :2.1180  
##  3rd Qu.: 0.6526   3rd Qu.: 1.9678   3rd Qu.: 1.9840   3rd Qu.:2.5873  
##  Max.   : 2.1107   Max.   : 3.6335   Max.   : 3.9021   Max.   :6.7987  
##       V17              V18                V19               V20         
##  Min.   :0.3127   Min.   :-5.46608   Min.   :-4.3787   Min.   :-0.2644  
##  1st Qu.:2.2409   1st Qu.:-2.64172   1st Qu.:-1.8291   1st Qu.: 1.5714  
##  Median :2.9347   Median :-2.08895   Median :-1.3750   Median : 2.1134  
##  Mean   :3.0002   Mean   :-2.12591   Mean   :-1.4255   Mean   : 2.1639  
##  3rd Qu.:3.5848   3rd Qu.:-1.49090   3rd Qu.:-0.9557   3rd Qu.: 2.7004  
##  Max.   :6.8233   Max.   :-0.03203   Max.   : 0.5073   Max.   : 5.2495  
##       V21               V22                V23               V24        
##  Min.   :0.01134   Min.   :-2.54460   Min.   :-0.5480   Min.   :0.8049  
##  1st Qu.:0.93116   1st Qu.:-1.48365   1st Qu.: 0.8966   1st Qu.:2.4145  
##  Median :1.18143   Median :-1.16581   Median : 1.2026   Median :2.9177  
##  Mean   :1.20967   Mean   :-1.18606   Mean   : 1.2185   Mean   :2.9391  
##  3rd Qu.:1.48541   3rd Qu.:-0.88348   3rd Qu.: 1.5257   3rd Qu.:3.4268  
##  Max.   :2.73146   Max.   : 0.05926   Max.   : 3.5885   Max.   :5.7931  
##       V25             V26               V27               V28         
##  Min.   :1.211   Min.   :-0.4323   Min.   :-0.7182   Min.   :-1.5583  
##  1st Qu.:2.963   1st Qu.: 0.7057   1st Qu.: 0.3917   1st Qu.:-0.6010  
##  Median :3.552   Median : 0.9976   Median : 0.6321   Median :-0.2963  
##  Mean   :3.647   Mean   : 1.0246   Mean   : 0.6612   Mean   :-0.3137  
##  3rd Qu.:4.274   3rd Qu.: 1.3350   3rd Qu.: 0.9198   3rd Qu.:-0.0279  
##  Max.   :7.789   Max.   : 2.2600   Max.   : 2.5197   Max.   : 0.7582  
##       V29              V30                V31              V32        
##  Min.   :0.9073   Min.   :-0.90707   Min.   :0.4196   Min.   :0.8413  
##  1st Qu.:2.3765   1st Qu.: 0.07084   1st Qu.:1.3177   1st Qu.:1.9556  
##  Median :2.8098   Median : 0.33579   Median :1.5898   Median :2.3274  
##  Mean   :2.8906   Mean   : 0.33268   Mean   :1.6390   Mean   :2.3297  
##  3rd Qu.:3.3481   3rd Qu.: 0.57211   3rd Qu.:1.9321   3rd Qu.:2.6798  
##  Max.   :5.6674   Max.   : 1.52168   Max.   :3.6829   Max.   :4.1580  
##       V33             V34               V35              V36         
##  Min.   :1.101   Min.   :-0.7895   Min.   :0.7633   Min.   :-2.1235  
##  1st Qu.:2.246   1st Qu.:-0.0871   1st Qu.:1.4822   1st Qu.:-1.1708  
##  Median :2.655   Median : 0.1118   Median :1.7759   Median :-0.9426  
##  Mean   :2.702   Mean   : 0.1240   Mean   :1.8081   Mean   :-0.9477  
##  3rd Qu.:3.076   3rd Qu.: 0.3197   3rd Qu.:2.0937   3rd Qu.:-0.6943  
##  Max.   :5.286   Max.   : 1.1699   Max.   :3.3043   Max.   : 0.2092  
##       V37                V38               V39              V40        
##  Min.   :-0.97981   Min.   :-0.3542   Min.   :-2.979   Min.   :0.3441  
##  1st Qu.:-0.13053   1st Qu.: 0.4123   1st Qu.:-1.850   1st Qu.:1.1226  
##  Median : 0.10106   Median : 0.7171   Median :-1.600   Median :1.4140  
##  Mean   : 0.09765   Mean   : 0.7028   Mean   :-1.597   Mean   :1.4253  
##  3rd Qu.: 0.31659   3rd Qu.: 0.9664   3rd Qu.:-1.292   3rd Qu.:1.6926  
##  Max.   : 1.11785   Max.   : 1.8466   Max.   :-0.459   Max.   :3.1198  
##       V41             V42              V43             V44          
##  Min.   :1.087   Min.   :0.5578   Min.   :1.172   Min.   :-0.71873  
##  1st Qu.:2.065   1st Qu.:1.2595   1st Qu.:2.389   1st Qu.: 0.05207  
##  Median :2.467   Median :1.6025   Median :2.728   Median : 0.23924  
##  Mean   :2.479   Mean   :1.5876   Mean   :2.786   Mean   : 0.25166  
##  3rd Qu.:2.829   3rd Qu.:1.8906   3rd Qu.:3.111   3rd Qu.: 0.44896  
##  Max.   :4.310   Max.   :3.1970   Max.   :5.451   Max.   : 1.28876  
##       V45               V46              V47               V48       
##  Min.   :-1.2524   Min.   :0.6458   Min.   :-0.5757   Min.   :1.160  
##  1st Qu.:-0.5622   1st Qu.:1.4902   1st Qu.: 0.4477   1st Qu.:2.098  
##  Median :-0.3427   Median :1.7671   Median : 0.6749   Median :2.423  
##  Mean   :-0.3435   Mean   :1.7748   Mean   : 0.6717   Mean   :2.462  
##  3rd Qu.:-0.1292   3rd Qu.:2.0273   3rd Qu.: 0.8764   3rd Qu.:2.769  
##  Max.   : 0.5476   Max.   :3.2006   Max.   : 1.8239   Max.   :4.117  
##       V49               V50       
##  Min.   :0.02562   Min.   :1.167  
##  1st Qu.:0.90468   1st Qu.:2.758  
##  Median :1.11844   Median :3.253  
##  Mean   :1.12614   Mean   :3.301  
##  3rd Qu.:1.33620   3rd Qu.:3.756  
##  Max.   :2.10368   Max.   :7.159

Yes, 50 markets. We asked for that!

summary( posterior_samples$a_bar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6089  1.1652  1.3467  1.3479  1.5075  2.2848
summary( posterior_samples$sigma)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.067   1.440   1.597   1.616   1.750   2.766

Before we calculate our estimated log-odds, let’s check our estimates for the population of parameters from which each intercept comes:

plt <- data.frame(alpha_bar = posterior_samples$a_bar) %>% 
  ggplot(aes(alpha_bar)) +
  geom_histogram(binwidth = 0.01, color = "darkblue", fill = "dodgerblue4", alpha = 0.7) +
  geom_vline(aes(xintercept = 1.5), linetype = 2, color = "red") +
  labs(title = "Posterior samples for population intercepts (alpha)")
ggplotly( plt )

It seems that we’ve correctly captured the mean of the population. The MAP is about 1.38. What about the standard deviation of the distribution?

plt <- data.frame(sigma = posterior_samples$sigma) %>% 
  ggplot(aes(sigma)) +
  geom_histogram(binwidth = 0.01, color = "darkblue", fill = "blue", alpha = 0.7) +
  geom_vline(aes(xintercept = 1.5), linetype = 2, color = "red") +
  labs(title = "Posterior samples for population standard deviation")
ggplotly( plt )

Our estimates for the variation in the population could be better. The MAP can be either 1.62 or 1.67, the vagaries of MCMC. Let’s check our estimated probability of transacting for each market.

# first our logistic function t convert scores into probabilities
logistic_ours <- function(z) {
  1/(1+exp(-z))
}
# yes, a matrix of estimated intercepts by market
matrix_estimated_probs <- logistic_ours(posterior_samples$a_market)
glimpse(matrix_estimated_probs)
##  num [1:500, 1:50] 0.921 0.904 0.632 0.884 0.816 ...

We have a matrix of 500 rows (500 simulations) and 50 columns (50 different markets). The column average across samples will estimate the probability of transacting in each market.

partial_pooling_estimates <- apply(matrix_estimated_probs, 2, mean)
data_simulation <- data.frame(
  estimated_probability_partial_pooling = partial_pooling_estimates) %>% 
  cbind(data_simulation)
glimpse(data_simulation)
## Rows: 50
## Columns: 6
## $ estimated_probability_partial_pooling <dbl> 0.9007274, 0.7873951, 0.90168...
## $ market                                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10...
## $ N_i                                   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ true_log_odds                         <dbl> 3.5564377, 0.6529527, 2.04469...
## $ number_transacted                     <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5,...
## $ estimated_probability_no_pooling      <dbl> 1.00, 0.80, 1.00, 1.00, 0.60,...

Then we transform our true log-odds into true probabilities:

data_simulation <- data_simulation %>% 
  mutate(true_probabilities = inv_logit(true_log_odds)
         )
glimpse(data_simulation)
## Rows: 50
## Columns: 7
## $ estimated_probability_partial_pooling <dbl> 0.9007274, 0.7873951, 0.90168...
## $ market                                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10...
## $ N_i                                   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ true_log_odds                         <dbl> 3.5564377, 0.6529527, 2.04469...
## $ number_transacted                     <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5,...
## $ estimated_probability_no_pooling      <dbl> 1.00, 0.80, 1.00, 1.00, 0.60,...
## $ true_probabilities                    <dbl> 0.97225163, 0.65767555, 0.885...

We are now ready to answer further questions about the efficacy of pooling and the resulting shrinkage.

Pooling and shrinkage made manifest

Pooling means sharing information across, in this simulation, markets. This is done by explicitly modeling the distribution of the average log-odds ratios of transacting across markets. In this way our estimated mean for the distribution of intercepts for each market will inform each of our predictions. Let’s calculate this estimated global mean across markets.

estimated_global_mean <- data.frame(
  alpha_bar = posterior_samples$a_bar) %>% 
  mutate(alpha_bar = inv_logit(alpha_bar)) %>% 
  summarise(mean(alpha_bar))
estimated_global_mean <- estimated_global_mean[1,1]
glue::glue("The estimated global mean is: {round(estimated_global_mean, 2)}")
## The estimated global mean is: 0.79

We should view our handiwork to understand how market estimates relate to the estimated global mean.

levels_market <- c("Sample size in markets: 5",
                   "Sample size in markets: 10",
                   "Sample size in markets: 25",
                   "Sample size in markets: 35",
                   "Sample size in markets: 40"
                   ) 
plt_data <- data_simulation %>% 
  select(market, estimated_probability_partial_pooling, estimated_probability_no_pooling, N_i) %>% 
  pivot_longer(-c(market, N_i), names_to = "method", values_to = "estimated_probability") %>% 
  mutate(N_i = glue::glue("Sample size in markets: {N_i}"),
         N_i = factor(N_i, levels = levels_market))
plt <- plt_data %>% 
  ggplot(aes(market, estimated_probability, color = method)) +
  geom_point(alpha = 0.6) +
  geom_hline(aes(yintercept = estimated_global_mean), linetype = 2, color = "red") +
  facet_wrap(~N_i, scales = "free") +
  scale_color_viridis_d() +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "bottom") +
  labs(title = "Pooling and Shrinking in a Multilevel Model",
       subtitle = "Global estimated mean informs predictions for each market: shrinking estimates to the global estimated mean",
       caption = "Global estimated mean shown in red.")
ggplotly( plt )

Now we can see that, with partial pooling, our estimates are informed by the estimated global mean. We shrink whatever proportion we calculate for the specific market towards this overall mean. We can be seen by zooming in on the yellow points, the estimates from partial pooling, and noticing that they are always closer to the red line than the purple points, the sample market proportion. Pooling results in more aggressive shrinkage for the markets where we have fewer data. Will these market predictions be the ones that need to be shrunk the most?

The Benefits of Pooling and Shrinkage

To answer that last question we can compare how well the different models performed.

# let's recall the levels and their sample sizes h
levels_market <- c("Sample size in markets: 5",
            "Sample size in markets: 10",
            "Sample size in markets: 25",
            "Sample size in markets: 35",
            "Sample size in markets: 40")
# we could use a paste0() across the sample sizes
plt_data <- data_simulation %>% 
  mutate(no_pooling_error = abs(estimated_probability_no_pooling - true_probabilities),
         partial_pooling_error = abs(estimated_probability_partial_pooling - true_probabilities)) %>% 
  select(market, no_pooling_error, partial_pooling_error, N_i) %>% 
  pivot_longer(-c(market, N_i), names_to = "method", values_to = "error") %>% 
  mutate(N_i = glue::glue("Sample size in markets: {N_i}"),
         N_i = factor(N_i, levels = levels_market)) 
plt <- plt_data %>% 
  ggplot(aes(error, factor(market), color = method)) +
    geom_point(alpha = 0.6) +
    scale_color_viridis_d() +
    facet_wrap(~N_i, scales = "free_y") +
    hrbrthemes::theme_ipsum_rc(grid = "Y") +
    theme(legend.position = "bottom") +
    labs(title = "Partial pooling vs. no-pooling: benefits of shrinkage",
         subtitle = "Low sample sizes and outliers benefit",
         y = "market")
ggplotly( plt )

This plot compares our estimated probability to the true probability we simulated across markets.

Some reflections are in order.

  1. Partial pooling will shrink the number of effective variables and yield less complex models. This is most helpful for the markets where we have relatively fewer data (i.e., markets with sample size of 5 and 10). For these markets, we complement the little data that we have with the information pooled from larger markets. In this way we shrink our estimates to the population mean that we’ve estimated across all markets. The model with no pooling just uses the information in the low sample markets, resulting in overfitting that shows itself in the plot in the form of large prediction errors. The comparison between the two methods shows us how shrinkage can reduce overfitting and thus predict more reliably out of sample.

  2. The amount of shrinkage depends on the amount of data available. When we have fewer data, we shrink a lot. When we have lots of data, we do shrink, but a lot less. For markets that have lots of data ( e.g., sample size of 35), partial pooling results in an almost identical prediction as the method with no pooling.

  3. The model with no pooling can sometimes beat the model with partial pooling. However, on average, the model with partial pooling will often perform much better.